The reproducibility and predictivity of radiomic features extracted from dynamic contrast-enhanced computed tomography of hepatocellular carcinoma

Purpose To assess the reproducibility of radiomic features (RFs) extracted from dynamic contrast-enhanced computed tomography (DCE-CT) scans of patients diagnosed with hepatocellular carcinoma (HCC) with regards to inter-observer variability and acquisition timing after contrast injection. The predictive ability of reproducible RFs for differentiating between the degrees of HCC differentiation is also investigated. Methods We analyzed a set of DCE-CT scans of 39 patients diagnosed with HCC. Two radiologists independently segmented the scans, and RFs were extracted from each sequence of the DCE-CT scans. The same lesion was segmented across the DCE-CT sequences of each patient’s scan. From each lesion, 127 commonly used RFs were extracted. The reproducibility of RFs was assessed with regard to (i) inter-observer variability, by evaluating the reproducibility of RFs between the two radiologists; and (ii) timing of acquisition following contrast injection (inter- and intra-imaging phase). The reproducibility of RFs was assessed using the concordance correlation coefficient (CCC), with a cut-off value of 0.90. Reproducible RFs were used for building XGBoost classification models for the differentiation of HCC differentiation. Results Inter-observer analyses across the different contrast-enhancement phases showed that the number of reproducible RFs was 29 (22.8%), 52 (40.9%), and 36 (28.3%) for the non-contrast enhanced, late arterial, and portal venous phases, respectively. Intra- and inter-sequence analyses revealed that the number of reproducible RFs ranged between 1 (0.8%) and 47 (37%), inversely related with time interval between the sequences. XGBoost algorithms built using reproducible RFs in each phase were found to be high predictive ability of the degree of HCC tumor differentiation. Conclusions The reproducibility of many RFs was significantly impacted by inter-observer variability, and a larger number of RFs were impacted by the difference in the time of acquisition after contrast injection. Our findings highlight the need for quality assessment to ensure that scans are analyzed in the same physiologic imaging phase in quantitative imaging studies, or that phase-wide reproducible RFs are selected. Overall, the study emphasizes the importance of reproducibility and quality control when using RFs as biomarkers for clinical applications.


Introduction
Radiomics is an emerging translational field that aims to extract and analyze data from medical images, using quantitative image features known as radiomic features (RFs), to support evidence-based clinical decision-making [1][2][3].Machine learning models built from RFs have a wide range of clinical applications, including predicting cancer prognosis and predicting aortic dissection [4,5].However, for these models to have greater legitimacy in clinical practice, the RFs from which they are built must be reproducible under a wide variety of factors related to image acquisition and processing [6][7][8][9].For instance, a feature that is meaningful in one dataset may not be in another due to its sensitivity to acquisition settings (e.g., scanner manufacturer, scanning technique, and reconstruction parameters).As a result, the reproducibility of RFs has been extensively studied using human cohorts of a variety of pathologies and phantom data as well [10][11][12][13][14][15][16][17][18].
Despite this, limited studies have examined the reproducibility of RFs across and within different CT contrast-enhancement phases, mainly due to the lack of dynamic contrast-enhanced computed tomography (DCE-CT) images for this purpose.DCE-CT is typically used in the diagnosis and characterization of primary liver lesions, such as hepatocellular carcinoma [19][20][21].DCE-CT scans are taken at different time points as the contrast travels through various organs and clinically be classified into three contrast enhancement phases (arterial, portal venous, and delayed phase) based on the LI-RADS 2018 criteria [22].Because of the sensitivity of RFs, the accuracy and validity of models built using these features extracted from CT images acquired in different phases can be impacted [23][24][25].There exists a need to study their reproducibility across and within all the contrast enhancement phases.However, there is a scarcity of literature on this topic, particularly in the imaging of liver lesions.Since Hepatocellular Cell Carcinoma (HCC) lesions show different characteristics in different imaging phases [26], biologically meaningful RFs could potentially have unique measurement values across the different phases [15,27,28].However, to date, no study has evaluated the reproducibility of RFs within time windows in each contrast enhancement phase.
In this study, we present a unique dataset of DCE-CT scans from HCC patients.Our aims are: (i) to investigate the effects of differences in lesion segmentation on the reproducibility of RFs, and (ii) to assess the reproducibility of RFs within and across contrast enhancement phases, namely the non-contrast enhanced (NCE), late arterial (L-AP), and portal venous phases (PVP).Ultimately, the goal is to guide robust analysis of RFs extracted from contrast enhanced CTs.

Patient data
Completely de-identified DCE-CT scans of 68 patients who underwent liver lesion assessment were retrospectively collected at a single medical center, with the institutional review board approval.The inclusion criteria were: (i) Pathologically proven HCC; (ii) Absence of artifacts in the scans.Patients with pathologic diagnoses other than HCC (n = 17), patients with scans containing artifacts (n = 9), and those with missing sequences (n = 3) were excluded.This resulted in a total of 39 patients included for analysis in this study (Fig 1).All Scans were acquired prior to the start of treatment.The study was conducted in accordance with the Declaration of Helsinki and was approved by the Institutional Review Board of Sun Yat-Sen University Cancer Center (SYSUCC) (protocol code 510060, approved on November 9th, 2022).Informed consent was waived by the Institutional Review Board of SYSUCC.The data was accessed for research purposes on the 11 th of January 2023.

Imaging data, segmentation, and RFs extraction
The DCE-CT scans were acquired using a TOSHIBA Aquilion scanner.Each sequence scan included four slices, with time intervals between consecutive DCE sequences of 1-2 seconds.Scanning of the patients commenced immediately after contrast injection.The number of DCE-CT sequences per patient ranged between 36 and 42 sequences.The acquisition and reconstruction parameters for the included DCE-CT scans are presented in Table 1.
The volumes of interest (VOIs) of HCCs were independently segmented by two radiologists (QW and PG, with four and five years of experience in abdominal imaging, respectively) using  an integrated tumor segmentation tool customized from the open-source Weasis platform [29].Each radiologist segmented the VOIs on the sequence where the tumor was most visible.The segmentations were then automatically propagated to the remaining sequences, and further reviewed by each radiologist to ensure correct and consistent lesion segmentation across all sequences.RFs were extracted from the VOIs using an in-house software.A set 127 RFs was extracted from each lesion, derived from different feature classes, including 'First Order Statistics', 'Sigmoid Feature', 'Discrete Wavelet Transform', 'Edge Frequency', 'Fractal Dimension', 'GTDM', 'Gabor', 'LAW filter', 'LOG feature', 'Run Length', 'Spatial correlation', 'GLCM', to characterize image patterns as comprehensively as possible.More details about feature class definitions as well as implementation details can be found in our previous work [10].No image resampling was performed, and RFs were extracted by setting the bin width to 25 Hounsfield units.The extracted RFs are provided in S1 File.

Analysis of inter-observer variability
Two radiologists (QW and PG) independently assigned one of the labels (NCE, L-AP, or PVP) to each of the DCE sequences and segmented the HCC lesions (Fig 2).The labels were based on the LI-RADS Version 2018 criteria for defining dynamic CT phases, as well as other commonly used clinical criteria [22,30].Disagreements over the labels were reviewed and discussed with a third radiologist (YC, with six years of experience), and a consensus was reached on all labels.The similarity in the segmentations between the two radiologists was assessed using Dice Similarity Coefficient (DSC) [31].The agreement in feature values extracted from all the sequences between the radiologists' segmentations was assessed as one of the primary endpoints.

Analysis of effects of phase variability
To assess the agreement in RF values within different phases for each radiologist independently, a different approach was used.The number of sequences within each phase needed to be the same for all the patients.Therefore, the fewest number of sequences available per phase across all patients was identified and set as the number of sequences to be included.Several patients had only two NCE sequences, therefore only the first two NCE sequences were included for all the patients.For the L-AP phase, seven sequences were selected for each patient: the first two, the middle three, and the last two sequences.Similarly, seven PVP sequences were included for each patient: the first two, the middle three, and the last two sequences.Pairwise comparisons were performed across all 16 selected sequences.The concordance correlation coefficient (CCC) with a cutoff of 0.90 was used to identify the within-phase reproducible RFs.Following the identification of reproducible RFs, features that were found to be highly correlated were removed.High correlation was defined as Spearman's R > 0.90.When two RFs were found to be highly correlated, the one with the higher average correlation with the remaining RFs was removed.The study workflow is depicted in Fig 3.

Statistical analyses
All statistical analyses were performed using R language on RStudio version 2022.02.0 [32,33].
To assess the agreement in RF values between the two radiologists, the CCC was used [34].
Pairwise comparisons were made across the included scans.RF values extracted from all segmented lesions by the two radiologists were compared once using all the data, and once within each contrast enhancement phase.RFs with CCC values equal to or higher than 0.90 were considered reproducible [35].
To assess the correlations between the reproducible RFs and the degree of histologic differentiation of the HCC lesions, Wilcoxon rank-sum test to assess differences in values across groups of well to moderately differentiated tumors and moderately to poorly differentiated tumors was used.The significance level was set at 0.05.
Machine learning was used to develop classification models using the reproducible RFs.For this analysis, the data were first split into 29 (74%) training and 10 (26%) testing sets.The outcomes in the training set were balanced using the synthetic minority oversampling technique (SMOTE).Following that, if the number of the reproducible features was less than 3, all were used for building the final model.If the number of reproducible RFs exceeded 3, recursive feature elimination with treebag functions and 5-fold cross-validation was used to select the most important RFs, with a maximum of 3. XGBoost algorithm was used to develop the classification model.The model was validated on the test set, and the AUC, sensitivity, specificity, negative predictive value (NPV), and positive predictive value (PPV) were used to assess the models' performance.

Inter-observer variability
The assessment of segmentation similarity between the radiologists showed a DSC of 0.79.Among the extracted features, 29 (22.8%)RFs were concordant across the NCE sequences; 52 (40.9%)RFs were concordant across the L-AP sequences; and 36 (28.3%)RFs were concordant across the portal venous phase sequences (Table 2).

Clinical correlations
Descriptive statistics.The association between the reproducible RFs and the degree of histologic differentiation of HCC was assessed for each reproducible RF within each phase per radiologist.
The descriptive statistics of reproducible RFs, and Wilcoxon's p value for radiologists 1 and 2 are presented in Tables 5 and 6, respectively.
Classification models: L-AP.For radiologist 1, the selected RFs were: "Gabor Max Z", and "Gabor sum Z".For radiologists 2, the selected RFs were: "LoG Z Entropy", "LoG Z Uniformity", and "LoG Z MGI".The performance of the models is presented in Fig 7. Classification models: PVP.For radiologist 1, the selected RFs were: "DWF Z L", "DWF Z LL", and "LoG Z Uniformity".For radiologists 2, the included RFs were: "Gabor Median Z", "Gabor sum Z", and "Gabor Mean Z".The performance of the models is presented in Fig 8.

Discussion
In this study, using our HCC DCE-CT dataset, we assessed the effects of interobserver (intersegmentation) variability on the reproducibility of RFs, as well as the agreement in RF values within each of the three clinically used imaging phases.Uniquely, we analyzed DCE-CT scans, which provide sequential CT images with specific time intervals (in a range of seconds).Thus, we were able to analyze the reproducibility of RFs within the window of different contrast enhancement phases, which was not previously investigated.As expected, our results showed that the differences in RF values attributed to the variations in imaging timing/sequences were more profound compared to the interobserver effects.At least a quarter of the extracted RFs were reproducible between the two radiologists across different phases, while the number of reproducible RFs for the same radiologist varied between 1% and 37% depending on the pairs of DCE-CT sequences compared.The removal of the highly correlated RFs further significantly reduced the number of reproducible RFs.Henceforth, the segmentation and timing variabilities are important factors that significantly affect the reproducibility of RFs.These findings align with previous studies that assessed the effects of inter-observer variability and clinically used imaging phases on the reproducibility of RFs [13,15,[36][37][38][39][40].
The findings of this study are consistent with previous research that, in a more limited manner, investigated the effects of contrast enhancement on the reproducibility of RFs [27,40,41].A prior study investigating the reproducibility of HCC RFs across the imaging phases (arterial and PVP) reported that 25% of the original RFs extracted with Pyradiomics toolbox were reproducible [15].Another published study examining the effects of variations in imaging phase on the reproducibility and predictive power of renal cell carcinoma RFs across the NCE, AP, and PVP scans reported a maximum agreement of 22.4% between the NCE and PVP  scans, while the PVP RFs were found to be the least predictive of overall survival in renal cell carcinoma patients [24].Based on the findings of these studies, the tumor type and site variations also impact the effects of contrast enhancement on the reproducibility of RFs, as different sets of features were reported across studies that investigated different tumor types and sites, in addition to the differences in type and make of imaging hardware.Unlike prior studies, our data allowed us to investigate within-phase reproducibility.Our results demonstrate that even subtle changes in acquisition time can significantly affect the reproducibility of the extracted RFs.Different phases are acquired to study tumor changes, such as intensity wash-in/wash-out, and they should be analyzed separately in quantitative image analysis.Yet, these were sometimes analyzed together in the literature.The majority of the RFs in this experiment had a very narrow window of reproducibility across the DCT sequences.This confirms the need for both care and caution when investigating RFs acquired in even slightly different contrast enhancement.This is critical for radiomics analyses since most imaging cohorts, whether publicly or privately available, are acquired in different contrast enhancement phases.We strongly recommend the inclusion of a phase determination step in radiomics studies analyzing contrast-enhanced imaging datasets.Interestingly, our results showed that the numbers of reproducible RFs within each phase and across all pairs differed per radiologist.The highest number of reproducible RFs was observed in the PVP comparisons for both radiologists.While this result could be due to the different numbers of comparisons available for each phase, it might also relate to the appearance of HCC lesions in different imaging phases.Nevertheless, when considering the agreement between radiologists, the L-AP phase had the highest number of reproducible noncorrelated RFs, which was also the phase where radiologists performed the first segmentation.It is worth noting that while these RFs have the highest reproducibility, their predictive value must also be considered when selecting the most suitable phase for HCC radiomic studies.Our results reiterate the need for proper quality and reproducibility assessments before performing radiomics analyses.
When considering interobserver variability, our analysis revealed a high agreement in RF values between radiologists in less than a third of the extracted RFs.The number of RFs varied slightly when each phase was assessed separately, with PVP segmentations showing the highest number of reproducible RFs.A similar pattern was observed for intra-phase variability; the highest concordance in RF values was observed across the PVP comparisons.
The evaluation of reproducible RFs within each phase, and for each radiologist, demonstrated a high discriminative ability between the degrees of HCC differentiation in our dataset.These RFs, which intuitively describe the texture of the lesions, thus meet both key criteria for biomarkers: reproducibility and predictivity.
While we carefully designed and executed the statistical analyses in this study, several limitations remain.First, the number of sequences per phase varied among the included patients, which we addressed by standardizing the number of sequences per patient for imaging phase analyses.The scans were selected based on their position within the phase sequences.The different number of within-phase comparisons most likely affects the final number of reproducible RFs per phase.Second, different vendors and imaging parameters were used to acquire the scans, which impacts the reproducibility of RFs.Although the comparisons in this experiment were longitudinal, the rank of patients could be variably affected, ultimately impacting the calculated CCC values.The lack of data prevented the analysis of the effects of variations in imaging acquisition and reconstruction parameters on RFs.While the number of patients included in this study was limited to 39, CCC values are robust in a sample size as small as 10 patients.In addition, previous studies investigating the reproducibility of RFs used a similar number of patients [10,28,[42][43][44][45][46], including studies on HCC radiomics [47][48][49][50][51]. Lastly, although the reproducible RFs were found to be predictive of the degree of HCC differentiation, the limited number of patients constrains the generalizability of this finding.However, this study serves as a pilot, especially since previous radiomics studies investigating the association between RFs and HCC differentiation have primarily focused on magnetic resonance imaging features.
In conclusion, our results indicate that the majority of RFs are sensitive to variations in the time of acquisition following the injection of a contrast agent.Future radiomics studies should analyze scans acquired in different contrast enhancement phases separately or at least consider the imaging phase during analysis.Furthermore, interobserver variability significantly affects the reproducibility of RFs and must be accounted for in multi-observer radiomics studies.While portal venous phase scans yielded the highest reproducibility within and among radiologists and could be recommended for multi-institutional HCC radiomics studies, biological intent must also be considered when designing such a study.